Authors


Hector R. Gavilanes Chief Information Officer
Gail Han Chief Operating Officer
Michael T. Mezzano Chief Technology Officer


University of West Florida

November 2023

Agenda

  • Introduction
  • Method
  • Example
  • Application
  • Discussion

Principal Component Analysis (PCA)

  • Unsupervised Machine Learning
  • Dimensionality Reduction Technique
  • Data Exploration
  • Feature Extraction
  • Data Visualization
  • Simplification of complex dataset
  • Principal Component (PC): Capture variances explaining the original variables
  • Mitigate multicollinearity

Multivariate Approaches

Methods

  • Data matrix \(X\) of size \(N\) x \(P\).
  • Data is linearly related.
  • Continuous and normally distributed data.
    • Initial data distribution does not truly matter.
  • Variables are similar in scale and without extreme outliers.
  • Missing data: Imputation or removal of observations.
  • Centering and scaling: Transform variables to a mean of 0 and a standard deviation of 1. \[ z_{np} = \frac{x_{np} - \bar{x}_{p}}{{\sigma_{p}}} \]
  • Covariance: A measure of how two random variables vary together. \[ Cov(x,y) = \frac{\Sigma(x_i-\bar{x})(y_i-\bar{y})}{N} \]
  • Covariance Matrix: Symmetric \(p \times p\) matrix which gives the covariance values for each pair of variables in the dataset.
  • Nonzero vector whose direction is unaffected by a linear transformation.
  • An eigenvector is scaled by factor \(\lambda\), the eigenvalue.
  • Each principal component is given by the eigenvectors of the covariance matrix.
    • The eigenvectors represent the directions of the new principal axes.
    • The eigenvalues represent the magnitude of these eigenvectors.

Finding the Principal Components

  • Find the linear combination of the columns of \(X\) (the variables) which maximizes variance.
  • Let \(a\) be a vector of constants \(a_1, a_2, a_3, …, a_p\) such that \(Xa\) represents the linear combination which maximizes variance.
  • The variance of \(Xa\) is represented by \(var(Xa) = a^TSa\) with the covariance matrix \(S\).
  • Finding the \(Xa\) with maximum variance equates to finding the vector \(a\) which maximizes the quadratic \(a^TSa\), where \(a^Ta = 1\).
  • \(a\) is a unit-norm eigenvector with eigenvalue \(\lambda\) of the covariance matrix \(S\).
  • The largest eigenvalue of \(S\) is \(\lambda_1\) with the eigenvector \(a_1\), which we can define for any eigenvector \(a\): \[ var(Xa) = a^TSa = \lambda a^Ta = \lambda \]

Principal Components

  • Impose the restriction of orthogonality to the coefficient vectors of \(S\).
    • Ensure the principal components are uncorrelated.
  • Each \(Xa_k\) is a principal component of the dataset having eigenvectors \(a_k\) and eigenvalues \(\lambda_k\).
  • The elements of \(Xa_k\) are the factor scores of the PCs.
  • The elements of the eigenvectors \(a_k\) represent the loadings of the PCs.
  • The elements of \(Xa_k\).
  • How each observation scores on a PC.
  • Represents the projection of the original observations onto the PCs.
  • The elements of the eigenvectors \(a_k\).
  • The loadings represent the weights of the original variables in the computation of the PCs.
  • Eigenvectors: Represent directions of maximum variance.
  • Eigenvalues: Indicate the variance explained by each eigenvector.

Example

  • Abalone dataset from the UCI Machine Learning Repository.
  • 4177 observations of 9 variables which record characteristics of each abalone including sex, length, diameter, height, weights, and the number of rings.
  • The variables, apart from sex, are continuous and correlated.

Preprocessing the data

  • Exclude non-numeric variables from the dataset.
    • The variable Sex is excluded.
  • Check for missing data.
    • No missing data in the dataset.
  • Scale and center the data.
  • Check for and handle extreme outliers.
    • Outliers do not present a large problem.

Perform Principal Component Analysis

The prcomp() function performs principal component analysis on a dataset using the singular value decomposition method with the covariance matrix of the data.

  • The standard deviation for each PC represents the information captured by that principal component.
  • The proportion of variance is the percent of total variance captured by each PC.
  • The cumulative proportion gives the total variance captured by the PC and all prior PCs.

Visualizing the results

Interpreting the results

  • The loadings of the first two principal components show the contribution of each variable to PC1 and PC2.

Dataset - State Averages of Common Dialysis Quality Measures

  • Source: Center for Medicare & Medicaid Services (Data.CMS.gov)
  • data collected from 50 US states + 6 U.S. territories
  • 39 variables
    • 24 patient care quality in dialysis facilities
    • 14 characteristics of dialysis patients
    • Response variable: As Expected Survival
    • Index variable: categorical variable was removed

Dataset Summary

Dataset Selection Rationale

  • Driven by multicollinearity.

  • Features less significant in explaining variability.

  • All variables are numeric

  • Categorical Index variable.

Data Preparation

  • Efficient removal of white spaces in the dataset.

  • Editing variable names to enhance readability and meaningful.

Original: “Percentage.Of.Adult..Patients.With.Hypercalcemia..Serum.Calcium.Greater.Than.10.2.Mg.dL.”

Edited: “hypercalcemia_calcium > 10.2Mg.”

Missing Values

  • 34 missing values.

  • Imputation of missing values using the \(Mean\) (\(\mu\))

Distribution

  • Normality is not assumed.

QQ-Plot of Residuals

  • Outliers are present through the entire dataset

Standardization

  • Mean (\(\mu\)=0); Standard Deviation (\(\sigma\)= 1)

    \[ Z = \frac{{ x - \mu }}{{ \sigma }} \]

    \[ Z \sim N(0,1) \]

Outliers & Leverage

  • 3 Outliers

  • No leverage

  • Minimal difference.

  • No observations removed.

Correlations

  • Multicollinearity is present.
  • Threshold = 0.30.
  • 28 Correlated features.

Scree Plot

  • PC1 explains 40.8% variance.
  • PC2 explains 9.5% variance.

BiPlot

  • PC1 in black displays longest distance of its projection.
  • PC2 in blue displays shorter distance as expected.

Contribution of Variables

Modeling

# reproducible random sampling
set.seed(my_seed)  
 
# Create Target y-variable for the training set
y <- train_data$expected_survival  
# Split the data into training and test sets 
split <- sample.split(y, SplitRatio = 0.7) 
training_set <- subset(train_data, split == TRUE) 
test_set <- subset(train_data, split == FALSE) 
# Perform centering and scaling on the training and test sets
sc <- preProcess(training_set[, -target_index], 
                 method = c("center", "scale"))
training_set[, -target_index] <- predict(
  sc, training_set[, -target_index])
test_set[, -target_index] <- predict(sc, test_set[, -target_index])
# Perform Principal Component Analysis (PCA) preprocessing on the training data
pca <- preProcess(training_set[, -target_index], 
                  method = 'pca', pcaComp = 8)

# Apply PCA transformation to original training set
training_set <- predict(pca, training_set)

# Reorder columns, moving the dependent feature index to the end
training_set <- training_set[c(2:9, 1)]

# Apply PCA transformation to original test set
test_set <- predict(pca, test_set)

# Reorder columns, moving the dependent feature index to the end
test_set <- test_set[c(2:9, 1)]

Uncorrelated Matrix

8 Principal Components

PC Regression

8 Components

2 Components

Cross-Validation Model

# Cross-validation with n folds
k_10 <- trainControl(method = "cv", number = 10)

# training the model 
model_cv <- train(expected_survival ~ ., 
                  data = train_pca,
                  method = "lm",
                  trControl = k_10)

# Print Model Performance
print(model_cv)

Results

  • PCA was performed using SVD.
  • PC1 captures 40.80% of the variance in the data.
    • PC1 and PC2 capture 50.27% of the variance.
    • Over 90% of the information in the dataset can be explained by the first eleven PCs.
  • The variables which contribute the most to PC1 are
    • expected_hospital_readmission
    • expected_transfusion
    • expected_hospitalization
  • PC2 has large contributions from variables measuring phosphorus.
  • PCR was performed with expected_survival used as the response.
  • Estimates and significance of each PC regressor have key differences.
    • For example, PC3 is not a significant regressor while PC4 is.
  • Both models produced an \(R^2\) above 96% and a predicted \(R^2\) above 95% with a 1% advantage on the cross-validation model.

Discussion

  • Principal Component Analysis is an essential multivariate analysis technique
  • Covariance matrix -> SVD -> Principal Components
  • Scaling Data
  • Sensitive to outliers
  • Loss of interpretability in transformed features.
  • Loss of Information

PCA is a definitely a useful tool to have in your toolkit!

Questions and Comments

References

  1. M. Ringnér, “What is principal component analysis?” Nature biotechnology, vol. 26, no. 3, pp. 303–304, 2008.
  2. I. T. Jolliffe and J. Cadima, “Principal component analysis: A review and recent developments,” Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences, vol. 374, no. 2065, p. 20150202, 2016.
  3. B. M. S. Hasan and A. M. Abdulazeez, “A review of principal component analysis algorithm for dimensionality reduction,” Journal of Soft Computing and Data Mining, vol. 2, no. 1, pp. 20–30, 2021.
  4. B. Everitt and T. Hothorn, An introduction to applied multivariate analysis with r. Springer Science & Business Media, 2011.
  5. M. Greenacre, P. J. Groenen, T. Hastie, A. I. d’Enza, A. Markos, and E. Tuzhilina, “Principal component analysis,” Nature Reviews Methods Primers, vol. 2, no. 1, p. 100, 2022.
  6. K. Pearson, “LIII. On lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin philosophical magazine and journal of science, vol. 2, no. 11, pp. 559–572, 1901.
  7. R. A. Fisher and W. A. Mackenzie, “Studies in crop variation. II. The manurial response of different potato varieties,” The Journal of Agricultural Science, vol. 13, no. 3, pp. 311–320, 1923.
  8. H. Hotelling, “Analysis of a complex of statistical variables into principal components.” Journal of educational psychology, vol. 24, no. 6, p. 417, 1933.
  9. D. Esposito and F. Esposito, Introducing machine learning. Microsoft Press, 2020.
  10. M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of cognitive neuroscience, vol. 3, no. 1, pp. 71–86, 1991.
  11. S. Zhang and M. Turk, “Eigenfaces,” Scholarpedia, vol. 3, no. 9, p. 4244, 2008.
  12. F. Pedregosa et al., “Scikit-learn: Machine learning in python,” the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011.
  13. J. Maindonald and J. Braun, Data analysis and graphics using r: An example-based approach, vol. 10. Cambridge University Press, 2006.
  14. J. Lever, M. Krzywinski, and N. Altman, “Points of significance: Principal component analysis,” Nature methods, vol. 14, no. 7, pp. 641–643, 2017.
  15. F. L. Gewers et al., “Principal component analysis: A natural approach to data exploration,” ACM Computing Surveys (CSUR), vol. 54, no. 4, pp. 1–34, 2021.
  16. J. Hopcroft and R. Kannan, Foundations of data science. 2014.
  17. “Quarterly dialysis facility care compare (QDFCC) report: July 2023.” Centers for Medicare & Medicaid Services (CMS). Available: https://data.cms.gov/provider-data/dataset/2fpu-cgbb. [Accessed: Oct. 11, 2023]
  18. R Core Team, “Prcomp, a function of r: A language and environment for statistical computing.” R Foundation for Statistical Computing, Vienna, Austria, 2023. Available: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/prcomp. [Accessed: Oct. 16, 2023]
  19. S. R. Bennett, “Linear algebra for data science.” 2021. Available: https://shainarace.github.io/LinearAlgebra/index.html. [Accessed: Oct. 16, 2023]
  20. D. G. Luenberger, Optimization by vector space methods. John Wiley & Sons, 1997.
  21. S. Nash Warwick and W. Ford, “Abalone.” UCI Machine Learning Repository, 1995.
  22. J. Pagès, Multiple factor analysis by example using r. CRC Press, 2014.
  23. E. K. CS, “PCA problem / how to compute principal components / KTU machine learning.” YouTube, 2020. Available: https://youtu.be/MLaJbA82nzk. [Accessed: Nov. 01, 2023]
  24. F. Chumney, “PCA, EFA, CFA,” pp. 2–3, 6, Sep., 2012, Available: https://www.westga.edu/academics/research/vrc/assets/docs/PCA-EFA-CFA_EssayChumney_09282012.pdf
  25. H. Abdi and L. J. Williams, “Principal component analysis,” WIREs Computational Statistics, vol. 2, no. 4, pp. 433–459, 2010, doi: https://doi.org/10.1002/wics.101. Available: https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/wics.101
  26. R Core Team, “Lm: Fitting linear models.” R Foundation for Statistical Computing, Vienna, Austria, 2023. Available: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/lm. [Accessed: Nov. 08, 2023]
  27. Kuhn and Max, “Building predictive models in r using the caret package,” Journal of Statistical Software, vol. 28, no. 5, pp. 1–26, 2008, doi: 10.18637/jss.v028.i05. Available: https://www.jstatsoft.org/index.php/jss/article/view/v028i05
  28. R. Bro and A. K. Smilde, “Principal component analysis,” Analytical methods, vol. 6, no. 9, pp. 2812–2831, 2014.